class: center, middle, inverse, title-slide # Spatial Data Workshop: Working with Census Data and Introduction to Map Making ### Josemari Feliciano ### USDA Rural Development - Innovation CenterYale University - Department of Biostatistics ### 04/06/2022 --- <style type="text/css"> body { font-size: 12px; } td { font-size: 12px; } code.r{ font-size: 12px; } .remark-code, .remark-inline-code { font-size: 90%; } </style> # Workshop Goals 1. Introduce you to various geographic boundaries (e.g., counties, tracts, block groups) we work with here in the US. 2. Learn how to properly define rural across various geographic boundaries. 3. Introduction to Census Geocoding tools using (a) their web interface and (b) the censusxy package. 4. Provide an overview of various datasets offered by the US Census Bureau. 5. Provide a detailed introduction to the American Community Survey (ACS) data. 6. Learn how to use R packages (e.g., censusapi, tidycensus) to seamlessly download and work with ACS data. 7. Learn the basics of static map making using ggplot2 and sf. --- # Before we go continue: </br> Let us pause for a minute or two before we continue with the workshop. </br> Go to: https://api.census.gov/data/key_signup.html - Sign up for a quick API key from the Census. We will be using the following R packages later: censusxy, censusapi, tigris, dplyr, ggplot2, and tidycensus. --- ## Geographic Identifiers (GEOIDs): The Basics. __Geographic identifiers (or GEOIDs)__ are numeric codes that uniquely identify all administrative/legal and statistical geographic areas. - Without a common identifier among geographic and demographic datasets, researchers and other stakeholders would have a difficult time pairing the appropriate demographic data with the appropriate geographic data, thus considerably increasing data processing times and the likelihood of data inaccuracy. Here in the US, we _primarily_ use what are called __Federal Information Processing Series (FIPS) codes__. - Many US-based datasets would label their geographic and demographic datasets with either GEOID or FIPS to indicate the relevant code. Datasets use GEOID and FIPS interchangeably. - If you are working with spatial data, it is best to have the FIPS code to easily merge the datasets. --- ### Geographic hierarchies <img src="data:image/png;base64,#geography_level.PNG" width="60%" style="display: block; margin: auto;" /> Typically, the key notable geographic levels scientists and policy makers concern themselves with are: (1) State, (2) County, (3) Census Tract, (4) Census Block, and (5) Zip Code Tabulation Areas (ZCTAs). --- ## Geographic hierarchies <img src="data:image/png;base64,#geographic.png" width="80%" style="display: block; margin: auto;" /> --- ## Federal Information Processing Standards (FIPS) <img src="data:image/png;base64,#GEOIDStructure.PNG" width="55%" style="display: block; margin: auto;" /> Again, GEOID/FIPS codes are typically what we use to identify both the geographic level and specific location we are working with. FIPS and GEOID are often used synonymously with one another. --- ## State FIPS You can get this from many websites. This specific list is from [Census](https://www2.census.gov/geo/docs/reference/state.txt).
--- ## A Quick Detour: Non-Census Data at County and Tract Level Federal agencies and researchers are increasingly using the CDC/ATSDR __Social Vulnerability Index (SVI)__. __From CDC:__ "Natural disasters and infectious disease outbreaks can pose a threat to a community’s health. Socially vulnerable populations are especially at risk during public health emergencies because of factors like socioeconomic status, household composition, minority status, or housing type and transportation." __SVI Availability:__ Data are available at county- and tract- level. __Index Range:__ The index (labelled RPL_THEMES in the dataset) is a score between 0 (least vulnerable) and 1 (most vulnerable). --- ## SVI Components The latest SVI data is for 2018. The dataset calculates the SVI using Census ACS data (more on this later) for 2014-2018. Click [here](https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/SVI_documentation_2018.html) for detailed SVI 2018 documentation. <img src="data:image/png;base64,#CDC-SVI-Variables.jpg" width="60%" style="display: block; margin: auto;" /> --- ### Partial SVI Data at County-Level:
--- ### Partial SVI Data at Tract-Level:
--- ## How do we define rural geographically in the US? Major definitions we can use: - __Office of Management and Budget (OMB) Definition:__ County-level. - __USDA Rural-Urban Continuum Codes (RUCC):__ County-level. - __USDA Rural-Urban Commuting Area (RUCA):__ Tract-level. There are others that are somewhat difficult to work with (Census, Frontier and Remote Area Codes). Note: RUCA is also available at the zip code level (but I personally do not recommend this as zip code can change drastically across years). There is also another issue with postal zip codes vs zip-code tabulation areas (ZCTA) that others might find confusing (more on this later). --- ## Rural Definition: OMB OMB definition deals with another geographic level (core-based statistical areas). These statistical areas are composed of one or more counties. <img src="data:image/png;base64,#omb_map.gif" width="90%" style="display: block; margin: auto;" /> --- ## Rural Definition: OMB Rural is defined by OMB as core-based statistical areas that are not metropolitan areas (i.e, micropolitan areas and non-core areas). Core-based March 2020 data below is from [this Census dataset](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
--- ## Rural Definition: OMB OMB dataset is not very user friendly or "clean". Here's my processed clean data list of rural/non-rural counties:
--- ## Rural Definition: Rural-Urban Continuum Codes (RUCC) - Dataset created by the USDA Economic Research Service (ERS). - Goes beyond a binary classification: rural (non-metro) vs non-rural (metro). - Latest dataset is from 2013. Next data iteration should be out in 2023. - RUCC codes distinguish metropolitan (metro) counties by the population size of their metro area, and nonmetropolitan (nonmetro) counties by degree of urbanization and adjacency to metro areas. - Full documentation and dataset can be accessed by clicking [here](https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/). --- ## Rural Definition: Rural-Urban Continuum Codes (RUCC) Each county is assigned a RUCC between 1 and 9. RUCC Codes for Metro (non-rural) counties: - 1: Counties in metro areas of 1 million population or more - 2: Counties in metro areas of 250,000 to 1 million population - 3: Counties in metro areas of fewer than 250,000 population RUCC Codes for Non-metro (rural) counties: - 4: Urban population of 20,000 or more, adjacent to a metro area - 5: Urban population of 20,000 or more, not adjacent to a metro area - 6: Urban population of 2,500 to 19,999, adjacent to a metro area - 7: Urban population of 2,500 to 19,999, not adjacent to a metro area - 8: Completely rural or less than 2,500 urban population, adjacent to a metro area - 9: Completely rural or less than 2,500 urban population, not adjacent to a metro area --- ## Full RUCC Dataset
--- ## Rural Definition: Rural-Urban Commuting Area (RUCA) - Dataset created by the USDA Economic Research Service (ERS). - Dataset is from 2010. - RUCA classification is based on population density, urbanization, and daily commuting to identify urban cores and adjacent territory at the tract-level. - Data available at (1) tract- and (2) zip code- level. - Very technical methodology. For data and documentation, click [here](https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/) if you want to learn more. --- ## Rural Definition: Rural-Urban Commuting Area (RUCA) Each county is assigned a Primary RUCA code between 1 and 10 (99 if not coded). Primary RUCA Codes: - 1 - Metropolitan area core: primary flow within an urbanized area (UA) - 2 - Metropolitan area high commuting: primary flow 30% or more to a UA - 3 - Metropolitan area low commuting: primary flow 10% to 30% to a UA - 4 - Micropolitan area core: primary flow within an Urban Cluster of 10,000 to 49,999 (large UC) - 5 - Micropolitan high commuting: primary flow 30% or more to a large UC - 6 - Micropolitan low commuting: primary flow 10% to 30% to a large UC - 7 - Small town core: primary flow within an Urban Cluster of 2,500 to 9,999 (small UC) - 8 - Small town high commuting: primary flow 30% or more to a small UC - 9 - Small town low commuting: primary flow 10% to 30% to a small UC - 10 - Rural areas: primary flow to a tract outside a UA or UC - 99 - Not coded: Census tract has zero population and no rural-urban identifier information --- ## Sample RUCA Tract Data Only sample data (N=10) here as there are 70,000+ tracts. The full RUCA code dataset and documentation can be accessed [here](https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/).
--- ## Free geocoding tools You likely know which county you are in. However, you likely do not know the specific census tract or block group you are in. Luckily, Census has free geocoding tools (e.g., [web interface](https://geocoding.geo.census.gov/), API) that can help us! The censusxy package (an API wrapper) in R provides an easy way to geocode your data that works with the Census Geocoding API. Let us go over basic geocoding examples using the censusxy package. --- ## The censusxy package: retrieving longitude and latitude. __Code for finding coordinates for an address.__ ```r library(censusxy) cxy_single('1600 Pennsylvania Avenue NW', 'Washington', 'DC', 20500) # cxy_single('1600 Pennsylvania Avenue NW', 'Washington', 'DC') will also work! ``` </br> __Returned output by API call:__
</br>Here, coordinates.x is the __longitude__, while coordinates.y is the __latitude__. --- ## The censusxy package: retrieving geographies. Earlier, we learned that the longitude and latitude for the White House (1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500) are -77.03534 and 38.89875, respectively. __Code for finding geographies for coordinates:__ ```r cxy_geography(-77.03534, 38.89875) ``` </br> __Note:__ The code above will return a data frame with 164 columns. The output below is a subset for specific GEOID variables I wanted (run the code yourself to see all 164 columns). __Partial Output:__
--- ## List of Census Surveys and Datasets </br></br></br> .pull-left[ The US Census Bureau conducts 130+ surveys each year. A detailed list can be accessed by clicking [this](https://www.census.gov/programs-surveys/surveys-programs.html). Let us quickly explore this list using a web browser. ] .pull-right[ <img src="data:image/png;base64,#census_list.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Quick overview of Small Area Health Insurance Estimates (SAHIE) .pull-left[ You may access the large yearly SAHIE datasets by clicking [this](https://www2.census.gov/programs-surveys/sahie/datasets/time-series/estimates-acs/). __Recommended:__ Accessing the data using the SAHIE interactive tool to minimize data cleaning. The tool can be accessed by clicking [this](https://www.census.gov/data-tools/demo/sahie/#/). This greatly minimizes data cleaning/subsetting tasks. ] .pull-right[ <img src="data:image/png;base64,#sahie_tool.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Quick overview of Longitudinal Employer-Household Dynamics (LEHD) Datasets - Job-to-Job Flows (J2J) for job mobility statistics. - LEHD Origin-Destination Employment Statistics (LODES) for employment data based on locations (residence, workplace). __I worked with this extensively so some examples on the next page.__ - Post-Secondary Employment Outcomes (PSEO) for statistics on the earnings and employment outcomes of graduates of select post-secondary institutions in the United States. - Quarterly Workforce Indicators (QWI) for economic indicators including employment, job creation, earnings, and other measures of employment flows. - Specific documentations on LEHD datasets can be accessed [here](https://lehd.ces.census.gov/data/). --- ## Quick demo of the LEHD LODES datasets - LODES data is available at the census block level. - In the dataset that I will demonstrate, I tallied the employment data at the census tract level. - I will quickly show you via an excel workbook what clean LODES datasets at the census tract level looks like. --- ## The American Community Survey (ACS) Data The American Community Survey (ACS) is an ongoing yearly survey. __Census Bureau:__ "[ACS] is the premier source for detailed population and housing information about our nation." Two yearly versions: __ACS 1-year__ and __ACS 5-year.__ - Note: For older data (2007-2013), 3 year estimates exist. - __ACS 1-year estimates data__ for areas with populations of 65,000+. - __ACS 5 year estimates data__ for all areas regardless of population size. - Many datasets provided by other federal agencies are subsets or are created in part using ACS data. __Language:__ If someone says they're using the 5-year 2020 ACS data, they're referring to the 2016-2020 5-year ACS data. --- ## The American Community Survey (ACS) Data Many ACS-related documentation online. __What to look for and remember:__ Table Shells. Table shells (particularly for detailed tables) is a comprehensive list of variable documentation for ACS data. - Table shells are provided yearly. If you are doing a longitudinal study using ACS data (e.g., a 10 year study), it might be a good idea to check the relevant yearly shell tables to see if variables of interest are available for all years. - Let us go over the newly 2019 ACS Table Shell I uploaded in [my github repo](https://github.com/neonseri/RLadiesCensusSpatialWorkshop/blob/main/ACS2019_Table_Shells.xlsx). - Alternative ACS documentation: API documentation also lists all the available ACS variables [here](https://api.census.gov/data/2019/acs/acs5/variables.html). __Task:__ Let us spend a minute or two to see which variables may be included. The variables included might surprise you. Let us look up 'poverty', 'internet' and 'computer' as examples to determine the comprehensiveness of the ACS data. --- ### Research scenario and the censusapi package .pull-left[ We are researchers trying to understand the role of __household internet access__ in influencing health insurance coverage at the county level. As an initial try, let us get the number of county-level households for the entirety of the US. _Partial data is displayed below_ ] .pull-right[ Initial setup: ```r # If you don't have this package, # run: install.packages("censusapi") library(censusapi) Sys.setenv(CENSUS_KEY="YOUR API KEY HERE") household_data <- getCensus( name = "acs/acs5", # requests ACS5 data vintage = 2019, # requests 2019 data vars = c("B28002_001E"), #requests variable(s) region = "county:*") #requests geography ``` ]
--- ## Examples of Other Important censusapi calls An example of asking for multiple variables: ```r data1 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*") ``` An example of asking for Connecticut-only county-level data (Note: CT's FIPS code is 09). ```r data2 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*", regionin = "state:09") ``` An example of asking for Missouri-only tract-level data (Note: MO's FIPS code is 29). ```r data3 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "tract:*", regionin = "state:29") ``` __Note:__ If you want to fetch 1-year estimates only, simply change _name_ from acs/acs5 to acs/acs1. --- ## The recipe: Making Maps the Simple Way </br> 1. We need a shapefile file which is a digital format for storing geographic location and associated attribute information. We will use the tigris package to get the data directly from the US Census Bureau. 2. We need data to map we are interested in. 3. Merge data (e.g., SVI data) to the shapefile (which is also a data frame). 4. Leverage ggplot2 to render the map. __For this example:__ We will map the SVI data from the CDC for Missouri. You may download the relevant file from [my Github repo](https://github.com/neonseri/RLadiesCensusSpatialWorkshop/blob/main/MissouriSVI.csv) (or you may download the csv from the next page). --- ## Quick data look at MO's Tract-level SVI Data
--- ## Creating SVI Map for MO __Step 1:__ Retrieve shapefile needed. ```r library(tigris) mo_shape_file <- tracts(state = "MO") ``` __Step 2:__ Load MO SVI data into R. ```r library(dplyr) mo_svi_data <- read.csv("~/SpatialData/MissouriSVI.csv") %>% mutate(GEOID = as.character(FIPS)) # Note: Need to coerce to character as the GEOID from the shapefile is character type. # In Base R: # mo_svi_data <- read.csv("~/SpatialData/MissouriSVI.csv") # mo_svi_data$GEOID <- as.character(mo_svi_data$FIPS) ``` __Step 3:__ Merge SVI data to shapefile. ```r mo_shape_file_v2 <- left_join(mo_shape_file, mo_svi_data) # In Base R: merge(x=mo_shape_file,y=mo_svi_data,by="GEOID", all.x=TRUE) ``` __Step 4:__ Plot map (Note: RPL_THEMES is the SVI variable). ```r ggplot(data = mo_shape_file_v2) + geom_sf(aes(fill = RPL_THEMES)) ``` --- ## Generating Map .pull-left[ __Code for Plotting:__ ```r ggplot(data = mo_shape_file_v2) + geom_sf(aes(fill = RPL_THEMES)) ``` ] .pull-right[ __Map Output:__ <img src="data:image/png;base64,#~/SpatialData/DownloadedOutput/map1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Further Customizations .pull-left[ __Code for Plotting:__ ```r # Added theme_void() # to remove grid and grey background ggplot(data = mo_shape_file_v2) + geom_sf(aes(fill = RPL_THEMES)) + theme_void() ``` ] .pull-right[ __Map Output:__ <img src="data:image/png;base64,#~/SpatialData/DownloadedOutput/map2.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Further Customizations Part 2 .pull-left[ __Code for Plotting:__ ```r # Further customizes labels and color gradient ggplot(data = mo_shape_file_v2) + geom_sf(aes(fill = RPL_THEMES)) + theme_void() + scale_fill_gradient(low="#1fa187", high="#440154") + labs(fill='MO-Specific SVI') ``` ] .pull-right[ __Map Output:__ <img src="data:image/png;base64,#~/SpatialData/DownloadedOutput/map3.png" width="100%" style="display: block; margin: auto;" /> ] --- ## tigris package shapefiles In the previous example, we used `tracts(state = "MO")` to get the tract-specific shapefile for MO. Many other shape files are available. Two key examples: - For state-level map: `states()`. - For county-level map: `counties()`. To the best of my knowledge, there are 40-50 shapefiles available (e.g. AIANNH [American Indian, Alaska Native and Native Hawaiian] boundaries, zip code tabulation area (ZCTA) boundaries). __Speaking of ZCTA, a brief comment on ZCTA.__ --- ## Detour: Zip Code Tabulation Area (ZCTA) ACS datasets are also available at the ZCTA-level. This might sound like the zip codes we use in our addresses. But they are not the same. Most of the time, your postal zip code is the same as the ZCTA. Zip codes primarily used by the Postal Service for P.O. boxes will likely belong to a different ZCTA. The same is true for areas with few residential addresses (or areas that are primarily occuppied by commercial businesses). You may use a [Zip Code to ZCTA crosswalk file](https://udsmapper.org/zip-code-to-zcta-crosswalk/) created by UDS mapper. But might not be comprehensive or reflect ongoing changes with zip codes. Please talk to a geographer before doing any comprehensive work with zip codes or ZCTA. --- # tidycensus package for mapping ACS data If you want to map ACS-related data, the tidycensus package is the most convenient way to go. tidycensus eliminates the need to merge data into a shapefile. __It can create a plotting-ready shapefile with the desired ACS data!__ </br> __Task:__ Suppose we want to map the median % of household income spent on rent for each state using variable B25071_001. --- # tidycensus package for mapping ACS data __Task:__ Suppose we want to map the median % of household income spent on rent for each state using variable B25071_001. .pull-left[ __What to run for county-level ACS5 data:__ ```r library(tidycensus) census_api_key("YOUR CENSUS API KEY HERE") shapefile_with_data <- get_acs( geography = "state", variables = "B25071_001", year = 2019, survey = "acs5", geometry = TRUE, shift_geo = TRUE ) ``` ] .pull-right[ __Note:__ The default for `get_acs()` is ACS5. I just added it here for demonstration purposes. __The key part here is:__ Make sure `geometry = TRUE` as the default as FALSE. By setting geometry as TRUE, you are instructing get_acs() to return the final data as an SF object (shapefile) that is ready for map rendering via ggplot2. `shift_geo = TRUE` is also important as it will compress the distance between the contiguous United States with Alaska, Hawaii, and Puerto Rico. ] --- ## View shapefile_with_data using head() ```r library(tidycensus) census_api_key("YOUR CENSUS API KEY HERE") shapefile_with_data <- get_acs( geography = "state", variables = "B25071_001", year = 2019, survey = "acs5", geometry = TRUE, shift_geo = TRUE ) head(shapefile_with_data) ``` __Let us quickly run this in R to see the output!__ --- ## Rendering the map .pull-left[ ```r ggplot(data = shapefile_with_data) + geom_sf(aes(fill = estimate), color = NA) + theme_void() + labs(fill='Median Gross Rent as\na % of Household Income') + scale_fill_gradient(low="#1fa187", high="#440154") + theme(legend.position="bottom") ``` ] .pull-right[ <img src="data:image/png;base64,#~/SpatialData/DownloadedOutput/map4.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Importance of shifting geometry I mentioned earlier that `shift_geo = TRUE` is important. Here's the map you'd generate without setting that argue as TRUE. <img src="data:image/png;base64,#~/SpatialData/DownloadedOutput/ugly.png" width="80%" style="display: block; margin: auto;" /> --- ## Conclusion Thank you for attending today's spatial data and mapping workshop. __What to do and learn next?__ Some thoughts. Keep in touch. I mean this part! __Email:__ For general questions, you may email me at `jfeliciano@aya.yale.edu`. For USDA-related questions on the work we do, you may email me at `josemari.feliciano@usda.gov`. __Twitter:__ `@SeriFeliciano` __LinkedIn:__ https://www.linkedin.com/in/jmtfeliciano/